EDA of Mental Health Dataset¶

This notebook presents the exploratory data analysis process of the Mental Health Dataset through visualization.

The notebook utilizes code from MH_EDA.py and GeoBound_ChoroplethMap.py. These files can be found in the same folder as the notebook.

In [1]:
import MH_EDA as mh
import GeoBound_ChoroplethMap as gb
from IPython.display import Image

Content¶

  • Cleaning Mental Health Dataset
  • Explore MH Overall Status
  • Regional and Divisional MH Prevalence
  • Population Influence

Cleaning Mental Health Dataset¶

First of all, we need to load the mental health dataset, clean it and remove less relevant features.

In [2]:
mh_file_path = 'MHDS/Original/500_Cities__City-level_Data__GIS_Friendly_Format___2017_release_20240514.csv'

raw_df= mh.mh_load_file(mh_file_path)
raw_df.head()
Out[2]:
StateAbbr PlaceName PlaceFIPS Population2010 ACCESS2_CrudePrev ACCESS2_Crude95CI ACCESS2_AdjPrev ACCESS2_Adj95CI ARTHRITIS_CrudePrev ARTHRITIS_Crude95CI ... SLEEP_Adj95CI STROKE_CrudePrev STROKE_Crude95CI STROKE_AdjPrev STROKE_Adj95CI TEETHLOST_CrudePrev TEETHLOST_Crude95CI TEETHLOST_AdjPrev TEETHLOST_Adj95CI Geolocation
0 AL Birmingham 107000 212237 19.6 (19.2, 20.0) 19.8 (19.5, 20.2) 30.9 (30.8, 31.1) ... (46.6, 47.0) 5.2 ( 5.1, 5.3) 5.2 ( 5.1, 5.2) 26.1 (25.1, 27.2) 25.9 (25.0, 26.9) (33.52756637730, -86.7988174678)
1 AL Hoover 135896 81619 9.7 ( 9.3, 10.1) 9.9 ( 9.5, 10.4) 25.3 (25.0, 25.7) ... (34.2, 35.0) 2.2 ( 2.1, 2.3) 2.2 ( 2.1, 2.2) 9.6 ( 8.6, 10.8) 9.5 ( 8.5, 10.9) (33.37676027290, -86.8051937568)
2 AL Huntsville 137000 180105 15.1 (14.7, 15.4) 15.1 (14.8, 15.5) 27.5 (27.3, 27.7) ... (39.4, 40.0) 3.4 ( 3.3, 3.4) 3.3 ( 3.2, 3.3) 14.9 (14.1, 15.7) 14.7 (13.8, 15.5) (34.69896926710, -86.6387042882)
3 AL Mobile 150000 195111 16.9 (16.6, 17.2) 17.2 (16.9, 17.5) 30.5 (30.3, 30.6) ... (42.0, 42.4) 4.4 ( 4.3, 4.5) 4.1 ( 4.1, 4.2) 24.3 (23.4, 25.3) 24.1 (23.1, 25.0) (30.67762486480, -88.1184482714)
4 AL Montgomery 151000 205764 17.4 (17.0, 17.9) 17.5 (17.1, 17.9) 29.8 (29.7, 30.0) ... (41.0, 41.5) 4.1 ( 4.1, 4.2) 4.2 ( 4.1, 4.3) 21.2 (20.3, 22.2) 21.2 (20.1, 22.2) (32.34726453330, -86.2677059552)

5 rows × 117 columns

We are not interested in other chronic diseases, hence I will remove irrelevant chronic diseases and retain features related to mental health, along with other essential features.

In [3]:
mh_df = mh.mh_remove_chronics(raw_df)
mh_df.head()
Out[3]:
StateAbbr PlaceName PlaceFIPS Population2010 MHLTH_CrudePrev MHLTH_Crude95CI MHLTH_AdjPrev MHLTH_Adj95CI Geolocation
0 AL Birmingham 107000 212237 15.6 (15.4, 15.8) 15.6 (15.4, 15.8) (33.52756637730, -86.7988174678)
1 AL Hoover 135896 81619 10.4 (10.1, 10.7) 10.4 (10.1, 10.7) (33.37676027290, -86.8051937568)
2 AL Huntsville 137000 180105 13.3 (13.1, 13.6) 13.4 (13.2, 13.7) (34.69896926710, -86.6387042882)
3 AL Mobile 150000 195111 14.9 (14.7, 15.1) 15.0 (14.9, 15.2) (30.67762486480, -88.1184482714)
4 AL Montgomery 151000 205764 14.9 (14.7, 15.2) 14.8 (14.6, 15.1) (32.34726453330, -86.2677059552)

The dataset has been refined by excluding other chronic diseases, resulting in a dataframe focused on mental health. The remained features are explained in the table below:

Features Type Meaning
StateAbbr Plain Text State abbreviation
PlaceName Plain Text City name
PlaceFIPS Number City FIPS Code
Population2010 Number 2010 Census population count
MHLTH_CrudePrev Number Crude prevalence of poor mental health for 14 days or more among adults aged 18 years and older, 2015.
Crude prevalence represents the ratio of the total number of responses of 'not good' to the total number of valid responses (excluding those who refused to answer, provided no response, or indicated 'don’t know/not sure').
MHLTH_Crude95CI Plain Text Estimated 95% confidence interval for crude prevalence
MHLTH_AdjPrev Number Age-adjusted prevalence, standardized by the direct method to the year 2000 standard U.S. population, distribution 9. [1]
MHLTH_Adj95CI Plain Text Estimated 95% Confidence interval for age-adjusted prevalence
Geolocation Plain Text Latitude, longitude of city centroid

Further cleaning and manipulation will be necessary as some features are less useful or stored in an incorrect format:

Removing Features:

  • PlaceFIPS: We will use PlaceName as the primary feature and StateAbbr as the secondary feature to align with the environmental dataset. The FIPS code is less useful for this project and will therefore be removed.
  • MHLTH_CrudePrev, MHLTH_Crude95CI: We will use age-adjusted prevalence because it represents standardized prevalence.

Transforming Format:

  • Geolocation: Geolocation needs to be converted into a list of two floats representing latitude and longitude.

[1] The direct method, aligned with the year 2000 standard U.S. population distribution 9, is a statistical technique used to adjust for age differences by assigning different weights to various age groups. This method is a policy mandated by the Department of Health and Human Services (DHHS) across all its agencies, aiming to enhance the comparability of age-adjusted rates among data systems.(reference) Distribution 9 indicates that this age-adjusted prevalence uses the weighting factors provided by Distribution 9. For more information about the weight, check page 3.

In [4]:
mhdf = mh.mh_secondary_remove_and_transform(mh_df)
mhdf.head()
Out[4]:
StateAbbr PlaceName Population2010 MHLTH_AdjPrev MHLTH_Adj95CI Geolocation
0 AL Birmingham 212237 15.6 (15.4, 15.8) [33.5275663773, -86.7988174678]
1 AL Hoover 81619 10.4 (10.1, 10.7) [33.3767602729, -86.8051937568]
2 AL Huntsville 180105 13.4 (13.2, 13.7) [34.6989692671, -86.6387042882]
3 AL Mobile 195111 15.0 (14.9, 15.2) [30.6776248648, -88.1184482714]
4 AL Montgomery 205764 14.8 (14.6, 15.1) [32.3472645333, -86.2677059552]

Explore MH Overall Status¶

I chose a treemap to present the overall status of all 500 cities instead of a bar chart because it effectively utilizes size and color to clearly depict the mental health prevalence in each city. Imagine trying to read data from a bar chart with 500 bars!

Plotly is an interactive visualization tool that allows us to extract more information by hovering over or clicking on a box to obtain detailed information.

Unfortunately, the interactive function is not available in the GitHub environment. The following pictures are treated as static visualizations to provide an overview. To explore the interactive functions, download the visualizations using the following links and open them with any web browser:

  • fig_city
  • fig_statecity
In [5]:
fig_city = mh.mh_plotly_treemap(mhdf, city_level=True, title='Mental Health Prevalence by City')

# output png and html files
mh.output_visuals(fig_city, 'MHDS/Visuals/fig_city.png')
mh.output_visuals(fig_city, 'MHDS/Visuals/fig_city.html', tohtml = True)

# fig_city.show()

# disable following code when running in local environment:
Image(filename='MHDS/Visuals/fig_city.png') 
Out[5]:

According to the treemap above, the top 5 cities with most severe mental issues are:

  • New Bedford, MA
  • Fall River, MA
  • Springfield, MA
  • Flint, MI
  • Reading, PA

Among these five cities, three are located in Massachusetts (MA). This observation raises the question of whether Massachusetts has the most severe mental health issues among all U.S. states.

To explore this, I will use a treemap to present state-level data. Treemaps are effective for displaying hierarchical data and illustrating the status of states along with the relationship between cities and their parent state. By clicking on a state box in the treemap, users can view the average prevalence of mental health issues. This interactive approach can help identify states facing severe mental health challenges.

In [6]:
fig_statecity = mh.mh_plotly_treemap(mhdf)

# output png and html files
mh.output_visuals(fig_statecity, 'MHDS/Visuals/fig_statecity.png')
mh.output_visuals(fig_statecity, 'MHDS/Visuals/fig_statecity.html', tohtml = True)

# fig_statecity.show()

# disable following code when running in local environment:
Image(filename='MHDS/Visuals/fig_statecity.png') 
Out[6]:

Upon examining the treemap, it is clear that Massachusetts (MA), with an average prevalence of 15.06, is the second most affected state by severe mental health issues. Ohio (OH) has the highest severity, with an average prevalence of 15.37.

Despite this, Massachusetts has a higher number of cities with significant mental health challenges. Three out of 13 cities have a prevalence over 17, whereas Ohio has only one such city. However, the inclusion of more cities in Massachusetts's sample, some with lower prevalences like Newton (9.2), reduces the state's average. This variation in city selection can introduce significant bias if we analyze at a geographic level larger than the city.

Considering these variations, it raises the question: could larger geographic regions, such as regions and divisions, influence mental health?

Regional and Divisional MH Prevalence¶

How would geographic locations (regions and divisions) affect mental health prevalence?

Inference: Environmental and socioeconomic statuses vary significantly among regions and divisions. Intuitively, locations with less green space and lower socioeconomic status might exhibit higher mental health prevalence. Thus, I infer that central areas of the US could have more severe mental health issues. However, the results might also be influenced by the data collection method, such as fewer data points from central areas.

Regardless, let's dive into the data. I will explore how mental health prevalence varies among regions and divisions using the choropleth map provided by the Folium package, which offers interactive functionalities, making the map more informative.

BBoth regional and divisional boundary data can be found and downloaded here.

In [7]:
# prepare the regional average df 
# import us_region dictionary
us_region = gb.us_region()

# apply labels from dictionary to mhdf and output a grouped df by regions
mh_regions_avg = mh.mh_apply_boundary(mhdf, 'Regions', us_region)

mh_regions_avg
Out[7]:
Regions Population2010 MHLTH_AdjPrev
0 Midwest 185709.967742 12.143011
1 Northeast 301909.553571 14.125000
2 South 203285.878205 12.705128
3 West 190411.533333 11.875897
In [8]:
# output geojson file of regions boundary if not exsited
regions_path = 'MHDS/Original/cb_2018_us_region_500k/cb_2018_us_region_500k.shp'
regional_bound = gb.bound_load_file_output_geojson(regions_path, full_state = True, output = True, output_folder = 'MHDS/', output_filename ='region_gdf.geojson')
regional_bound.head()
Be aware of large dataset!
File already exists.
Out[8]:
REGIONCE AFFGEOID GEOID NAME LSAD ALAND AWATER geometry
0 1 0200000US1 1 Northeast 68 419357835545 50259300137 MULTIPOLYGON (((-68.27472 44.25867, -68.27144 ...
1 2 0200000US2 2 Midwest 68 1943997274253 184273267512 MULTIPOLYGON (((-82.73571 41.60336, -82.73392 ...
2 4 0200000US4 4 West 68 4536201747682 316587292459 MULTIPOLYGON (((179.48246 51.98283, 179.48656 ...
3 3 0200000US3 3 South 68 2249871668369 134084610547 MULTIPOLYGON (((-75.56555 39.51485, -75.56174 ...
In [9]:
# create regional choropleth map
m_regions = gb.choropleth_map('MHDS/region_gdf.geojson', mh_regions_avg)
display(m_regions)
Make this Notebook Trusted to load map: File -> Trust Notebook

Although the map shows that the Northeast region appears to have more severe mental health issues (14.2% on average), the regional map seems less informative. Therefore, we will explore further based on the nine divisions of the U.S.

In [10]:
# same as above, create divisional average df and boundary file
us_division = gb.us_division()
mh_division_avg = mh.mh_apply_boundary(mhdf, 'Divisions', us_division)
display(mh_division_avg.head())

division_path = 'MHDS/Original/cb_2018_us_division_500k/cb_2018_us_division_500k.shp'
divisional_bound = gb.bound_load_file_output_geojson(division_path, full_state = True, output = True, output_folder = 'MHDS/', output_filename ='division_gdf.geojson')
Divisions Population2010 MHLTH_AdjPrev
0 East North Central 198099.590164 12.727869
1 East South Central 246005.875000 14.425000
2 Middle Atlantic 510628.560000 14.356000
3 Mountain 202059.040000 11.460000
4 New England 118945.137931 13.941379
Be aware of large dataset!
File already exists.
In [11]:
# create the divisional choropleth map
m = gb.choropleth_map('MHDS/division_gdf.geojson', mh_division_avg, geo_col=['Divisions','MHLTH_AdjPrev'])

display(m)
Make this Notebook Trusted to load map: File -> Trust Notebook

From the map above, we can see that three divisions appear to have more severe mental health issues than other divisions: East South Central (14.42%), Middle Atlantic (14.36%), and New England (13.94%).

In fact, the entire Eastern region seems to experience more severe mental health issues compared to the Central and Western regions. Given that the Eastern area has a distinct environment and socioeconomic status compared to the Central and Western regions, this distinction provides a valuable starting point to further explore how environmental and socioeconomic factors correlate with mental health prevalence.

Population Influence¶

Does the size of a population affect mental health (MH) prevalence?

Inference: The size of the population, often a reference to the size of a city, can influence mental health through two aspects:

  • Accessibility to Green Spaces: Larger cities, having denser populations, typically offer less access to green spaces, which may exacerbate MH prevalence.
  • Socioeconomic Status: Conversely, large cities often have higher socioeconomic statuses, which can mitigate MH prevalence.

The relationship is complex, so let’s explore it using the dataset mhdf.

First, we need to sort cities into different size groups based on their population. Following the OECD Classification, we can categorize cities into four groups under a new column CitySize:

  • Small Urban Areas: 50,000 to 200,000 people.
  • Medium-Size Urban Areas: 200,000 to 500,000 people.
  • Metropolitan Areas: 500,000 to 1.5 million people.
  • Large Metropolitan Areas: 1.5 million or more people.

We will then apply this classification to mhdf and create a new DataFrame, df_CitySize, that includes the number of cities and the average MH prevalence for each city-size group (using groupby on CitySize). We may also create a squared MH prevalence column (square_MHLTH_AdjPrev) for better visualization.

Finally, we will use Altair (a visualization package that allows for flexible customizations) to create a visualization of Population vs. MH Prev combining a bar chart (presenting the number of cities for each group) and a scatter plot (presenting the average MH prevalence) to analyze the influence of population size on MH prevalence.

In [12]:
city_size_dict = {
    'Small Urban Areas': [50000,200000],
    'Medium-Size Urban Areas': [200000,500000],
    'Metropolitan Areas': [500000,1500000],
    'Large Metropolitan Areas': [1500000, 100**100]
}

# call function to add CitySize col to mhdf and output a grouped df

df_CitySzie = mh.mh_apply_CitySize(mhdf, city_size_dict)
df_CitySzie
Out[12]:
CitySize Population2010 MHLTH_AdjPrev square_MHLTH_AdjPrev
0 Large Metropolitan Areas 5 12.880000 165.894400
1 Medium-Size Urban Areas 73 12.549315 157.485309
2 Metropolitan Areas 29 12.458621 155.217229
3 Small Urban Areas 392 12.410204 154.013165
In [13]:
# call function to present the Population vs. MH Prev
mh.mh_pop_vs_mh(df_CitySzie)
Out[13]:

Unfortunately, according to the chart above, the size of the population seems uncorrelated with mental health prevalence.

I think it would be better to delve deeper into accessibility to green spaces and socioeconomic status instead of focusing on population size.